Skip to content

Conversation

@ikawrakow
Copy link
Owner

The implementation was assuming that the K and V cache are contiguous, and was using this assumption to dequantize to fp16. This is certainly wrong for the V cache, which is just a view of the K cache with rows of 512 instead of 576 elements.

@JohannesGaessler You may want to take a look at this PR. I don't think your PR in mainline llama.cpp can work for DeepSeek models with quantized KV cache.

A test session with this model:

./bin/llama-cli -m ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf -t 32 -ngl 100 -mla 3 -fa -c 32768 -s 1234 -ot exps=CPU -cnv -ctk q8_0 -ctv q8_0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes Log start main: build = 3673 (4084ca73) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1234 llama_model_loader: additional 2 GGUFs metadata loaded. llama_model_loader: loaded meta data with 53 key-value pairs and 959 tensors from ./ds2.5/DeepSeek-V2.5-1210-IQ3_XXS-00001-of-00003.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek V2.5 1210 llama_model_loader: - kv 3: general.version str = V2.5-1210 llama_model_loader: - kv 4: general.basename str = DeepSeek llama_model_loader: - kv 5: general.size_label str = 160x14B llama_model_loader: - kv 6: general.license str = other llama_model_loader: - kv 7: general.license.name str = deepseek llama_model_loader: - kv 8: general.license.link str = https://github.com/deepseek-ai/DeepSe... llama_model_loader: - kv 9: deepseek2.block_count u32 = 60 llama_model_loader: - kv 10: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 11: deepseek2.embedding_length u32 = 5120 llama_model_loader: - kv 12: deepseek2.feed_forward_length u32 = 12288 llama_model_loader: - kv 13: deepseek2.attention.head_count u32 = 128 llama_model_loader: - kv 14: deepseek2.attention.head_count_kv u32 = 128 llama_model_loader: - kv 15: deepseek2.rope.freq_base f32 = 10000,000000 llama_model_loader: - kv 16: deepseek2.attention.layer_norm_rms_epsilon f32 = 0,000001 llama_model_loader: - kv 17: deepseek2.expert_used_count u32 = 6 llama_model_loader: - kv 18: general.file_type u32 = 23 llama_model_loader: - kv 19: deepseek2.leading_dense_block_count u32 = 1 llama_model_loader: - kv 20: deepseek2.vocab_size u32 = 102400 llama_model_loader: - kv 21: deepseek2.attention.q_lora_rank u32 = 1536 llama_model_loader: - kv 22: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 23: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 24: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 25: deepseek2.expert_feed_forward_length u32 = 1536 llama_model_loader: - kv 26: deepseek2.expert_count u32 = 160 llama_model_loader: - kv 27: deepseek2.expert_shared_count u32 = 2 llama_model_loader: - kv 28: deepseek2.expert_weights_scale f32 = 16,000000 llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40,000000 llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0,100000 llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-llm llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 100000 llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 100001 llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 100001 llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 45: general.quantization_version u32 = 2 llama_model_loader: - kv 46: quantize.imatrix.file str = /models_out/DeepSeek-V2.5-1210-GGUF/D... llama_model_loader: - kv 47: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 716 llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 139 llama_model_loader: - kv 50: split.no u16 = 0 llama_model_loader: - kv 51: split.count u16 = 3 llama_model_loader: - kv 52: split.tensors.count i32 = 959 llama_model_loader: - type f32: 300 tensors llama_model_loader: - type q5_K: 1 tensors llama_model_loader: - type iq3_xxs: 597 tensors llama_model_loader: - type iq3_s: 61 tensors llm_load_vocab: special tokens cache size = 18 llm_load_vocab: token to piece cache size = 0,6411 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 102400 llm_load_print_meta: n_merges = 99757 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 60 llm_load_print_meta: n_head = 128 llm_load_print_meta: n_head_kv = 128 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_swa_pattern = 1 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 24576 llm_load_print_meta: n_embd_v_gqa = 16384 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-06 llm_load_print_meta: f_clamp_kqv = 0,0e+00 llm_load_print_meta: f_max_alibi_bias = 0,0e+00 llm_load_print_meta: f_logit_scale = 0,0e+00 llm_load_print_meta: n_ff = 12288 llm_load_print_meta: n_expert = 160 llm_load_print_meta: n_expert_used = 6 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000,0 llm_load_print_meta: freq_scale_train = 0,025 llm_load_print_meta: n_ctx_orig_yarn = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 236B llm_load_print_meta: model ftype = IQ3_XXS - 3.0625 bpw llm_load_print_meta: model params = 235,741 B llm_load_print_meta: model size = 84,604 GiB (3,083 BPW) llm_load_print_meta: repeating layers = 84,058 GiB (3,077 BPW, 234,693 B parameters) llm_load_print_meta: general.name = DeepSeek V2.5 1210 llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 126 'Ä' llm_load_print_meta: max token length = 256 llm_load_print_meta: n_layer_dense_lead = 1 llm_load_print_meta: n_lora_q = 1536 llm_load_print_meta: n_lora_kv = 512 llm_load_print_meta: n_ff_exp = 1536 llm_load_print_meta: n_expert_shared = 2 llm_load_print_meta: expert_weights_scale = 16,0 llm_load_print_meta: expert_weights_norm = 0 llm_load_print_meta: expert_gating_func = softmax llm_load_print_meta: rope_yarn_log_mul = 0,1000 llm_load_tensors: ggml ctx size = 0,80 MiB Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU llm_load_tensors: offloading 60 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 61/61 layers to GPU llm_load_tensors: CPU buffer size = 37343,30 MiB llm_load_tensors: CPU buffer size = 37866,68 MiB llm_load_tensors: CPU buffer size = 10656,64 MiB llm_load_tensors: CPU buffer size = 214,84 MiB llm_load_tensors: CUDA0 buffer size = 5109,97 MiB .................................................................................................... ============ llm_load_tensors: need to compute 60 wk_b tensors Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CUDA0 llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: mla_attn = 3 llama_new_context_with_model: attn_max_b = 0 llama_new_context_with_model: fused_moe = 0 llama_new_context_with_model: ser = -1, 0 llama_new_context_with_model: freq_base = 10000,0 llama_new_context_with_model: freq_scale = 0,025 llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512 llama_kv_cache_init: CUDA0 KV buffer size = 1147,53 MiB llama_new_context_with_model: KV self size = 1147,50 MiB, c^KV (q8_0): 1147,50 MiB, kv^T: not used llama_new_context_with_model: CUDA_Host output buffer size = 0,39 MiB llama_new_context_with_model: CUDA0 compute buffer size = 6346,00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 74,01 MiB llama_new_context_with_model: graph nodes = 3350 llama_new_context_with_model: graph splits = 179 main: chat template example: You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 32768, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to the AI.
  • To return control without starting a new line, end your input with '/'.
  • If you want to submit another line, end your input with ''.
> Count from 1 to 10 in French

Sure! Here's how you count from 1 to 10 in French:

  1. Un
  2. Deux
  3. Trois
  4. Quatre
  5. Cinq
  6. Six
  7. Sept
  8. Huit
  9. Neuf
  10. Dix

Tell me everything you know about radiation therapy of cancer

Radiation therapy, also known as radiotherapy, is a medical treatment that uses high doses of radiation to kill cancer cells and shrink tumors. Here’s an overview of everything you need to know about radiation therapy for cancer:

Types of Radiation Therapy

  1. External Beam Radiation Therapy (EBRT):

    • Description: Radiation is delivered from a machine outside the body, targeting the cancerous tumor.
    • Common Techniques:
      • 3D Conformal Radiation Therapy: Uses multiple radiation beams to shape the treatment to the tumor’s 3D shape.
      • Intensity-Modulated Radiation Therapy (IMRT): Delivers varying doses of radiation to different parts of the tumor, reducing damage to nearby healthy tissues.
      • Proton Therapy: Uses protons instead of X-rays, allowing for precise targeting of the tumor with minimal radiation to surrounding tissues.
      • Stereotactic Radiosurgery (SRS) and Stereotactic Body Radiation Therapy (SBRT): High-precision techniques used for small tumors or lesions, often in the brain or lung.
  2. Internal Radiation Therapy (Brachytherapy):

    • Description: Radioactive sources are placed inside the body, either temporarily or permanently, directly into or near the tumor.
    • Types:
      • High Dose Rate (HDR) Brachytherapy: Temporary placement of radioactive material for a short period.
      • Low Dose Rate (LDR) Brachytherapy: Permanent placement of radioactive seeds, commonly used for prostate cancer.
  3. Systemic Radiation Therapy:

    • Description: Radioactive substances are administered through the bloodstream, targeting cancer cells throughout the body.
    • Examples:
      • Radioactive iodine (I-131) for thyroid cancer.
      • Lutetium-177 (Lu-177) or Yttrium-90 (Y-90) for neuroendocrine tumors.

Purpose of Radiation Therapy

  1. Cancer Treatment:

    • Curative Intent: To eliminate the cancer completely, often used in early-stage cancers.
    • Palliative Treatment: To relieve symptoms and improve quality of life for advanced-stage cancers.
    • Adjuvant Therapy: Used after surgery to eliminate any remaining cancer cells.
    • Neoadjuvant Therapy: Used before surgery to shrink the tumor, making surgery easier and potentially reducing the extent of surgery needed.
  2. Prevention of Recurrence:

    • Radiation may be used to reduce the risk of cancer recurrence in high-risk patients.

Side Effects of Radiation Therapy

  1. Acute Side Effects:

    • Skin Reactions: Redness, irritation, and peeling.
    • Fatigue: Common and often temporary.
    • Swelling or Edema: Due to fluid accumulation in treated areas.
    • Mucositis: Inflammation of the mucous membranes, common in head and neck cancers.
  2. Late Side Effects:

    • Tissue Fibrosis: Scarring of the irradiated tissues.
    • Lymphedema: Swelling due to damage to the lymphatic system.
    • Cognitive Changes: Rare, but possible in brain radiation.
    • Secondary Cancers: Extremely rare but possible years after treatment.

Factors Influencing Radiation Therapy

  1. Tumor Characteristics:

    • Type, size, and location of the tumor.
    • Whether the tumor is localized or has spread (metastatic).
  2. Patient Characteristics:

    • Age and overall health.
    • Previous cancer treatments and their outcomes.
    • Genetic factors and cancer predisposition.
  3. Radiation Techniques and Doses:

    • The type of radiation used and the dose administered.
    • The fractionation schedule (how the dose is divided over time).

Advances in Radiation Therapy

  1. Imaging and Targeting:

    • Advanced imaging techniques like PET scans help in precisely targeting the tumor.
    • Image-guided radiation therapy (IGRT) uses real-time imaging to adjust the radiation beam during treatment.
  2. Personalized Medicine:

    • Customized treatment plans based on the individual patient’s tumor characteristics and genetics.
  3. Integration with Other Treatments:

    • Combination with chemotherapy, immunotherapy, or surgery for better outcomes.

Conclusion

Radiation therapy is a powerful tool in the fight against cancer, offering curative potential, symptom relief, and preventive measures. Advances in technology continue to improve its precision and reduce side effects, making it an essential component of modern cancer care.

How many r's are there in strawberry

The word "strawberry" contains 3 r's.

Can you solve the following system of equations for me? x + y = 2 x - y = 3

Certainly! Let's solve the system of equations step by step:

We are given two equations:

  1. ( x + y = 2 )
  2. ( x - y = 3 )

Step 1: Add the two equations

Adding the left-hand sides:
[ (x + y) + (x - y) = 2 + 3 ]
[ 2x = 5 ]

Step 2: Solve for ( x )

[ 2x = 5 ]
[ x = \frac{5}{2} ]
[ x = 2.5 ]

Step 3: Substitute ( x = 2.5 ) into one of the original equations to find ( y )

Let's use the first equation:
[ x + y = 2 ]
[ 2.5 + y = 2 ]
[ y = 2 - 2.5 ]
[ y = -0.5 ]

Final Solution

The solution to the system of equations is:
[ x = 2.5 ]
[ y = -0.5 ]

Can you explain the German electoral system?

Certainly! The German electoral system is characterized by a mixed-member proportional representation (MMP) system, which combines elements of both constituency-based voting and proportional representation. Here’s an overview of how it works:

Electoral System Overview

  1. Electoral Districts:

    • Germany is divided into 299 single-member constituencies (Wahlkreise), each represented by one directly elected Member of Parliament (MP).
    • Voters cast two votes: one for a candidate in their local constituency (Erststimme or first vote) and one for a political party (Zweitstimme or second vote).
  2. Directly Elected MPs (Erststimme):

    • The candidate with the most votes in each constituency wins a seat in the Bundestag (Federal Parliament).
  3. Party List MPs (Zweitstimme):

    • The second vote determines the proportional representation of parties in the Bundestag.
    • Each party’s share of the total second votes determines how many additional seats they receive beyond the directly elected MPs.

Seat Allocation Process

  1. Initial Allocation:

    • Directly elected MPs from the constituencies are first seated.
    • The remaining seats are filled based on the proportion of second votes each party receives.
  2. Overhang Seats:

    • If a party wins significantly more constituency seats than its share of the second votes would allow, additional "overhang" seats are added to maintain proportionality.
  3. Compensatory Seats:

    • To ensure overall proportionality, compensatory seats are allocated to parties that did not win enough constituency seats to match their share of the second votes.

Formation of the Bundestag

  1. Minimum Threshold:

    • Parties must receive at least 5% of the second votes nationwide or win at least three constituency seats to enter the Bundestag.
  2. Government Formation:

    • The party or coalition with the majority of seats forms the government.
    • If no single party has a majority, a coalition government is formed between two or more parties.

Key Features

  1. Stability:

    • The system promotes stable government formation by allowing for coalition-building among parties with similar political platforms.
  2. Representation:

    • Proportional representation ensures that smaller parties have a voice in the Bundestag, reflecting the diversity of political views in the electorate.
  3. Direct Representation:

    • Voters have a personal connection to their directly elected MP, enhancing local representation.

Conclusion

The German electoral system is designed to balance local representation with proportional representation of political parties, ensuring a stable and representative government. This hybrid system has been successful in maintaining political stability while reflecting the diverse political landscape of Germany.

Can you find the encoding rule for the following exmple: Encoded text: oyfjdnisdr rtqwainr acxz mynzbhhx Decoded text: Think step by step

To decode the encoded text "oyfjdnisdr rtqwainr acxz mynzbhhx" to "Think step by step," we need to find the encoding rule. Given the nature of the encoded text and the decoded text, it seems likely that the encoding involves a simple substitution cipher, where each letter in the plaintext is replaced by a different letter.

Let's break down the encoded text and compare it with the decoded text:

Encoded Text:
oyfjdnisdr rtqwainr acxz mynzbhhx

Decoded Text:
Think step by step

Step 1: Analyze the Lengths

  • "Think step by step" has 16 characters.
  • "oyfjdnisdr rtqwainr acxz mynzbhhx" has 32 characters.

This suggests that each letter in the decoded text corresponds to two letters in the encoded text.

Step 2: Identify Letter Pairs

Let's align the letters:

Decoded Text:
T h i n k s t e p b y s t e p

Encoded Text:
o y f j d n i s d r r t q w a i n r a c x z m y n z b h h x

Step 3: Match Letters

Now, let's match each pair of letters from the encoded text to a single letter in the decoded text:

  • To y
  • hf j
  • id n
  • ni s
  • kd r
  • sr t
  • tq w
  • ea i
  • pn r
  • ba c
  • yx z
  • sm y
  • tn z
  • eb h
  • ph x

Step 4: Formulate the Encoding Rule

From the above matching, we can infer that the encoding rule is:

  • Each letter in the decoded text corresponds to two letters in the encoded text.
  • The mapping is not a simple substitution but rather a pairing of letters.

Conclusion

The encoding rule for this example is that each letter in the decoded text is represented by a pair of letters in the encoded text. The specific mapping can be derived from the given example:

  • To y
  • hf j
  • id n
  • ni s
  • kd r
  • sr t
  • tq w
  • ea i
  • pn r
  • ba c
  • yx z
  • sm y
  • tn z
  • eb h
  • ph x

This rule can be used to decode other similarly encoded texts.

llama_print_timings: load time = 11641,32 ms
llama_print_timings: sample time = 135,78 ms / 2822 runs ( 0,05 ms per token, 20782,86 tokens per second)
llama_print_timings: prompt eval time = 24467,67 ms / 126 tokens ( 194,19 ms per token, 5,15 tokens per second)
llama_print_timings: eval time = 224627,46 ms / 2816 runs ( 79,77 ms per token, 12,54 tokens per second)
llama_print_timings: total time = 517022,50 ms / 2942 tokens

Here a quick sweep-bench performance test

fp16 KV cache

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 14.243 143.79 39.607 12.93
2048 512 2048 14.741 138.93 40.155 12.75
2048 512 4096 15.250 134.29 40.546 12.63
2048 512 6144 15.778 129.80 41.711 12.27
2048 512 8192 16.303 125.62 41.891 12.22
2048 512 10240 16.847 121.57 42.925 11.93
2048 512 12288 17.497 117.05 43.123 11.87
2048 512 14336 17.874 114.58 43.521 11.76

Q8_0 KV cache

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 14.284 143.38 39.549 12.95
2048 512 2048 14.795 138.42 40.182 12.74
2048 512 4096 15.379 133.17 40.770 12.56
2048 512 6144 18.119 113.03 42.032 12.18
2048 512 8192 16.466 124.38 42.423 12.07
2048 512 10240 16.945 120.86 43.506 11.77
2048 512 12288 17.601 116.35 43.925 11.66
2048 512 14336 17.987 113.86 44.597 11.48

I.e., only very slightly slower than fp16 KV cache. The KV cache is quite small with FlashMLA-3, but if one wants to go to 160k tokens with DeepSeek-V3/R1, using Q8_0 KV cache instead of fp16 may make the difference between being able or not being able to run with a single 24 GB GPU.

@JohannesGaessler
Copy link

Thank you for notifying me. I am aware of the defect, on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory. I can't comment on the specific code in this PR since I won't look at it unless you explicitly tell me I'm allowed to do so even without the conflict between you and Georgi first being resolved. The way I would have gone about it would have been not to use the V tensor at all, to dequantize K, and to then calculate the pointer, dimension, and strides for a pseudo V tensor from the K tensor.

@ikawrakow
Copy link
Owner Author

Forgot to add -rtr in the above performance test. Here it is with -rtr and q8_0 KV cache

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 13.348 153.43 36.662 13.97
2048 512 2048 14.637 139.92 37.208 13.76
2048 512 4096 14.478 141.46 37.720 13.57
2048 512 6144 14.880 137.64 39.034 13.12
2048 512 8192 16.081 127.36 39.282 13.03
2048 512 10240 16.240 126.11 40.409 12.67
2048 512 12288 17.001 120.47 40.805 12.55
2048 512 14336 18.056 113.42 41.437 12.36

@ikawrakow
Copy link
Owner Author

on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory.

Oh, yes, I forgot about that.

In any case, the PR in ik_llama.cpp is mostly a copy of your mainline PR, so you looking at the code you wrote in my repository hopefully does not break Georgi's rules.

@JohannesGaessler
Copy link

My concern specifically is whether you would consider any of my work on mainline after looking at your code to be including a "substantial portion" of your work and could thus only be included in conjunction with the copyright notices in ik_llama.cpp. Much like you I am not a lawyer but if you tell me that you will not consider me looking at your work to be a license violation (or that in some specific case you waive the requirement of copyright notices) then there is no need for lawyers in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants